Pseudo code for creating new datasets to use with the models and for performing
an autoregressive prediction with one of the models.

--- WARNING ---

This document was written a few years after the release of the paper and as
such, several years after the core work in the paper was completed.  When the
paper was released, all the of the training, validation and testing data was
provided, along with the trained models.  The code for a colab was also
included that allowed direct use of the models on any of the data.

What was not provided was a means for generating new data from FARSITE that the
models could also use.  At the time of publication, it was not feasible to
release the code that converts FARSITE data into the formats used by the models
(i.e., the format of the datasets that were provided at release).

This document serves as a high level attempt to provide pseudo code for the
process that could be used to generate new data that the provided models could
work with.

But there are no guarantees provided that following this guide will do so.
These instructions were created by the corresponding author on this work by
looking over the code that was used when generating the datasets, but the
pseudo code was never converted to real code by the author and never directly
tested.  However, it should fill in some of the finer points that could not be
included in the paper, and should help in figuring out how to generate
additional data for the models should you want to do so.

--- File: create_data_snippets.py.snippets ---

We've also included some code snippets from our pipeline that creates data.
This is *not* fully functional code, but it should fill in most of the details
needed.  Our pipeline was written in Python with Apache Beam and ultimately
saved Tensorflor RecordIOs full of tf.train.Example(s), but unless you're
working with large amounts of data, simply hacking out the parts from
ProcessInputFile and getting it to save numpy arrays that you then load in with
the code you're using the model should be fine.

--- Normalization files ---

All of the data stored in the data files used by the models are in 'normalized'
space.  Normalization was performed on each channel such that the resulting
data for each channel of the data has zero mean and unit variance.
Accomplishing that requires knowing the mean and standard deviation of the
data.

Each of the directories that contain data files on Google Cloud provides files
that contain the normalization coefficients used to generate the data in that
directory.  For instance, the data stored for 00 part of the Single Fuel
dataset is located in this location:

wildfire_conv_lstm/data/single_fuel/00

And that directory contains the following file:

norms.txt-00000-of-00001

The contents of that file are:

{
  'wind_east': (4143636000, -1.1608838702946704, 391.4318359179917),
  'wind_north': (4143636000, -0.16826022294298948, 373.6882014037226),
  'moisture_1': (15876000, 21.113, 115.18023825499108),
  'moisture_10': (15876000, 20.664999999999992, 122.30278270362706),
  'moisture_100': (15876000, 21.008, 120.69394360228928),
  'moisture_her': (15876000, 64.017, 389.62473554174403),
  'moisture_woo': (15876000, 64.478, 394.6595408588776),
  'cover': (15876000, 48.163999999999994, 862.4651583250923),
  'height': (15876000, 259.9290000000003, 18534.789126472217),
  'base': (15876000, 116.164, 9683.935713973273),
  'density': (15876000, 19.208, 127.71874404476843),
  'slope_east': (15876000, 0.004536269696326597, 0.11082113729584775),
  'slope_north': (15876000, 0.008847342851617231, 0.10451561303712933)
}

This is a python dictionary containing tuples.  The key contains the name of
the channel, e.g., 'wind_east' is the channel providing the magnitude of the
wind in the east direction.  The corresponding tuple provides:

1) A count of the number of cells in the dataset (this can be ignored).
2) The mean observed across all of the data.
3) The variance observed across all of the data.

Thus, the normalization of each pixel in each channel was done independently via:

pixel_norm = (pixel_unnorm - channel_mean) / (channel_var),

where channel_mean is the 2nd value in each tuple, and channel_var is the 3rd
value in each tuple.

All of the normalization files in each block of files (00 through 09) are the
same.

Notice there is not a normalization entry for the vegetation itself.  The
vegetation data contains value between 0.0 and 1.0 and we did not apply any
normalization to that data.

--- Channels in the data:

0: current vegetation.  Built up from the burn files as described below.
1: previous fire front.  Built up from the burn files as described below.
2: ash.  Built up from the burn files as described below.
3: wind_east.  Loaded.
4: wind_north.  Loaded.
5: moisture_1.  Loaded.
6: moisture_10.  Loaded.
7: moisture_100.  Loaded.
8: moisture_her.  Loaded.
9: moisture_woo.  Loaded.
10: cover.  Loaded.
11: height.  Loaded.
12: base.  Loaded.
13: density.  Loaded.
14: slope_east.  Loaded.
15: slope_north.  Loaded.
16: fuel.  Loaded.

--- Discrete fuel channel ---

The fuel channel contains a discrete id in it.  The code below is a snippet
from our pipeline that contains the specifics of how the fuel channel is set
up.  Basically, every pixel in the fuel channel contains a number between 0 and
len(FUEL_LIST), and the FUEL_LIST specifies which values belong to which fuel
ids.  I.e., all values of 0 in the fuel channel correspond to fuel GR1 (short
grass) with id 101, whereas values of 6 belong to fuel id 107, GR7.

# A list of the fuel types that appear in the data.  These fuel types are mapped
# to a sequence of ints that can be used in one-hot-encoding.
#
# This is a summary of the grouping of the fuel types, taken from page 5 of:
# https://www.fs.fed.us/rm/pubs/rmrs_gtr153.pdf
#
# NB: 90-99    non-burnable
# GR: 100-119  grass, increasing number tends to mean increasing amount
# GS: 120-139  mix of grass and shrub, higher tends to mean more
# SH: 140-159  shrub, higher tends to mean more
# TU: 160-179  grass + shrub + canopy litter, higher tends to mean more litter
# TL: 180-199  mostly canopy litter, higher number tends to mean more
# SB: 200-219  slash-blow down, higher means easier spread rate
#
# More detailed table of the fuel ids.
# s: spread
# f: flame height
# 101: GR1: s:moderate  f:low       short grass, patchy, maybe grazed
# 102: GR2: s:high      f:moderate  coarse continuous grass, 1 foot depth
# 104: GR4: s:very high f:high      coarse continuous grass, 2 foot depth
# 107: GR7: s:very high f:very high coarse continuous grass, 3 feet depth
# 121: GS1: s:moderate  f:low       1 foot shrubs
# 122: GS2: s:high      f:moderate  1-3 foot shrubs
# 123: GS3: s:high      f:moderate  grass/shrub, less than 2 feet
# 124: GS4: s:high      f:very high heavy grass/shrub, more than 2 feet
# 141: SH1: s:very low  f:very low  low shrub fuel load, <1 foot, some grass
# 142: SH2: s:low       f:low       moderate fuel, depth ~1 foot, no grass
# 145: SH5: s:very high f:very high heavy shrub load, 4-6 feet
# 147: SH7: s:high      f:very high
# 149: SH9: s:high      f:very high dense shrubgs w/find dead fuel, 4-6'
# 161: TU1: s:low       f:low       grass and shrub w/ litter
# 162: TU2: s:moderate  f:low       moderate litter, w/ shrubs
# 163: TU3: s:high      4:moderate  moderate litter, w/grass and shrubs
# 164: TU4: s:moderate  f:moderate  short conifer trees w/ grass/moss
# 165: TU5: s:moderate  f:moderate  high load conifer litter, shrubs
# 181: TL1: s:very low  f:very low  low load, hardwood (recently burned)
# 182: TL2: s:very low  f:very low  low load, hardwood (not recent burned)
# 183: TL3: s:very low  f:very low  not broadleaf or long-needle, w/fine fuels
# 184: TL4: s:low       f:low       not broadleaf or long-needle, moderate load
# 185: TL5: s:low       f:low       high load conifer litter, light slash
# 186: TL6: s:moderate  f:low       moderate load, hardwood (not recent burned)
# 187: TL7: s:low       f:low       not broadleaf or long needle, heavy load
# 188: TL8: s:moderate  f:low       long-need pine, moderate, w/herbaceous load
# 189: TL9: s:moderate  f:moderate  v-high load, hardwood (not recently burned)
# 201: SB1: s:moderate  f:low       blowdown1 or activity 1-3"diameter, <1 foot
# 202: SB2: s:moderate  f:moderate  blowdown2 or activity 1-3"diameter, >1 foot
# 203: SB3: s:high      f:high      blowdown3 or activity 0.25"diameter, >1 foot
# 204: SB4: s:very high f:very high blowdown4

FUEL_LIST = [101, 102, 103, 104, 105, 106, 107, 108, 109, 121, 122, 123, 124,
             141, 142, 143, 144, 145, 146, 147, 148, 149, 161, 162, 163, 164,
             165, 181, 182, 183, 184, 185, 186, 187, 188, 189, 201, 202, 203,
             204, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

--- Process for transforming FARSITE output files into data files. ---

This process is applied independently for each fire sequence.  All fire
sequences used for training had at least 20 time steps, so shorter sequences
should not be used.  We only used patches that were square where H = W.

Have FARSITE save all its output files in a directory: $INPUT_DIR.

$INPUT_DIR should contain:

* Files that provide the numpy array with shape: (H, W, T)
  * H = patch height, W = patch width, T = steps in time series
  * A burn.npy file -- that provides the fractional burned area in each pixel
    at each timestep. This value is computed from the intersection of the fire
    front polygon and the cell. At time T, each pixel is 0.0, and is constant
    or monotonically increasing afterwards to a value of 1.0, corresponding to
    a pixel which has been completely consumed (see burn_fraction.png as an
    example).
  * wind_east.npy file -- provides eastward-blowing component of wind velocity
    for each pixel at each step.
  * wind_north.npy file -- provides northward-blowing component of wind
    velocity for each pixel at each step.
  * Files that provide numpy array with shape: (H, W)
  * base.npy
  * cover.npy
  * density.npy
  * elevation.npy
  * fuel.npy -- The fuel id for each pixel, where each id is a member of
    FUEL_LIST (e.g. 102). This will need to be converted to an index into
    FUEL_LIST.
  * height.npy
  * moisture_100_hour.npy
  * moisture_10_hour.npy
  * moisture_1_hour.npy
  * moisture_live_herbaceous.npy
  * moisture_live_woody.npy
  * slope_east.npy
  * slope_north.npy

Process for prepping data.

* Load the normalization file.
* Load the (H, W, T) numpy array from the burn file.
* Load the (H, W, T) numpy array from the wind files.
* Load the (H, W) from all other channels.
* Convert the fuel channel from raw fuel ids (e.g. 102) to their corresponding
  indices into FUEL_LIST (e.g. 1).
* Normalize all data that needs it (all channels except for the vegetation
  channels).
* Build up the channel 0 (veg), channel 1 (prev_front) and channel 2
  (scar) sequences.
  * veg values all start at 1.0 and stay constant or shrink over time.
  * prev_front values all start at 0.0 and change dynamically over time.
  * scar values all start at 0. and stay constant or grow over time.
  * For each time step:
    * Using the burn.npy file, determine how much vegetation at each cell
      burned away in the current step.  Call that 'front' (every cell has a
      different front value).
    * veg = veg - 'front'
    * prev_front = 'front'
    * scar = scar + 'front'
    * All other channels get their values from the loaded files.  All other
      channels are constant over time, except wind_east and wind_north, which
      are only non-constant in the california_wn dataset.
    * Have the shape of all channels be (1, H, W, 1)
    * All channels created for a single time step then get concatenated
      together into the shape (1, H, W, C), where C = number of channels.
  * Concatenate all of the (1, H, W, C) values together into a (T, H, W, C)
    tensor, which is the data for the entire time sequence wrapped up into a
    single tensor.
  * Save the tensor.

--- Perform autoregressive predictions ---

There are two model types described in the paper.  The EPD model, and the
EPD-ConvLSTM model.

The EPD model takes input of the shape (B, H, W, C), where B = the batch
channel.  To create an autoregressive set of predictions, pick a slice from any
input file that is a single time step, which will have the shape (1, H, W, C),
and simply send that in to the model.  Note that with this shape in the saved
data, the first dimension is the time step, and with this shape going to the
model, it is the batch channel.  The model will return a prediction of shape
(1, H, W) which predicts how much vegetation will be removed from the fire at
this time step.  To construct a subsequent input for the model, simply subtract
the prediction from the first channel (the veg channel), add the prediction to
the third channel (the scar channel) and set the prediction itself as the
second channel (the previous front channel).

The EPD-ConvLSTM model takes an input of the shape (B, T', H, W, C) where T' is
the length of a sequence of time steps.  In our models T' = 8.  To create a
prediction, simply take a slice from the loaded data of size (8, H, W, C), and
then reshape it so it has the shape (1, 8, H, W, C), i.e., make it batch of
inputs with a batch size 1 (you can perform larger batches of course).  Like
the EPD model, the response is still of shape (1, H, W).  Create a new time
step for the sequence the same way you did for the EPD model of shape (1, H, W,
C), and now append that to the sequence you have, resulting in a tensor of size
(1, 9, H, W, C).  The models we used always used exactly 8 time steps, so
remove the first time step from the sequence, resulting again in the shape (1,
8, H, W, C), and send that to the model for the prediction for the next step.


